{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started with Noun Phrase to Vec (NP2vec)\n",
    "\n",
    "Noun Phrases (NP) play a particular role in NLP applications. \n",
    "This code consists in training a word embedding’s model for Noun NP’s using word2vec or fastText algorithm. All the terms in the corpus are used as context in order to train the word embedding’s model; however, at the end of the training, only the word embedding’s of the NP’s are stored, except for the case of fastText training with word_ngrams=1; in this case, we store all the word embedding’s, including non-NP’s in order to be able to estimate word embeddings of out-of-vocabulary NP’s (NP’s that don’t appear in the training corpora).\n",
    "\n",
    "This tutorial shows how to train an NP2vec model.\n",
    "\n",
    "First let’s install NLP Architect and wget libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone https://github.com/IntelLabs/nlp-architect.git\n",
    "!pip install nlp-architect/\n",
    "%cd nlp-architect\n",
    "!pip install wget"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's import relevant libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "dhrGFfcMMnS8"
   },
   "outputs": [],
   "source": [
    "import gzip\n",
    "import wget\n",
    "\n",
    "from nlp_architect.models.np2vec import NP2vec\n",
    "from solutions.set_expansion.prepare_data import load_parser, mark_noun_phrases"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's download a sample corpus (a subset of English Wikipedia dump)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://github.com/IntelLabs/nlp-architect/raw/master/datasets/wikipedia/enwiki-20171201_subset.txt.gz'  \n",
    "wget.download(url, 'enwiki-20171201_subset.txt.gz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's extract NP's from the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus='enwiki-20171201_subset.txt.gz'\n",
    "marked_corpus = 'enwiki-20171201_subset_marked.txt'\n",
    "chunker = 'spacy'\n",
    "with gzip.open(corpus, 'rt', encoding='utf8', errors='ignore') as corpus_file, open(marked_corpus, 'w', encoding='utf8') as marked_corpus_file:\n",
    "    nlp = load_parser(chunker)\n",
    "    num_lines = sum(1 for line in corpus_file)\n",
    "    corpus_file.seek(0)\n",
    "    print('%i lines in corpus', num_lines)\n",
    "    mark_noun_phrases(corpus_file, marked_corpus_file, nlp, num_lines, chunker)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train the NP2vec model and store it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np2vec = NP2vec(marked_corpus)\n",
    "np2vec.save()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NP2vec model can be used for term set expansion (Term Set Expansion Jupyter notebook available at nlp-architect/tutorials/Term_Set_Expansion/term_set_expansion.ipynb)."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "NP2vec_training.ipynb",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
